Determination of the Script and Language Content of Document Images

نویسنده

  • A. Lawrence Spitz
چکیده

Most document recognition work to date has been performed on English text. Because of the large overlap of the character sets found in English and major Western European languages such as French and German, some extensions of the basic English capability to those languages have taken place. However, automatic language identification prior to optical character recognition is not commonly available and adds utility to such systems. Languages and their scripts have attributes that make it possible to determine the language of a document automatically. Detection of the values of these attributes requires the recognition of particular features of the document image and, in the case of languages using Latin-based symbols, the character syntax of the underlying language. We have developed techniques for distinguishing which language is represented in an image of text. This work is restricted to a small but important subset of the world’s languages. The method first classifies the script into two broad classes: Han-based and Latin-based. This classification is based on the spatial relationships of features related to the upward concavities in character structures. Language identification within the Han script class (Chinese, Japanese, Korean) is performed by analysis of the distribution of optical density in the text images. We handle 23 Latin-based languages using a technique based on character shape codes, a representation of Latin text that is inexpensive to compute.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Analysis of Ministry of Education’s Strategic Plans Based on Favorable Components of English Language Teaching Using Shannon’s Entropy

The present research aims to analyze the content of Ministry of Education’s strategic plans (the Fundamental Reform Document of Education, the Comprehensive National Scientific Plan and the National Curriculum Document) based on Shannon's entropy regarding the favorable components of teaching English. The contents of the Fundamental Reform Document of Education, the Comprehensive National Scien...

متن کامل

Adaptive Algorithms for Automated Processing

Title of dissertation: ADAPTIVE ALGORITHMS FOR AUTOMATED PROCESSING OF DOCUMENT IMAGES Mudit Agrawal, Doctor of Philosophy, 2011 Dissertation directed by: Professor Larry Davis Department of Computer Science Dr. David Doermann University of Maryland Institute for Advanced Computer Studies ABSTRACT Large scale document digitization projects continue to motivate interesting document understanding...

متن کامل

WandaML a markup language for digital document annotation

WandaML is an XML-based markup language for the annotation and filter journaling of digital documents. It addresses in particular the needs of forensic handwriting data examination, by allowing experts to enter information about writer, material (pen, paper), script and content, and to record chains of image filtering and feature extraction operations applied to the data. We present the design ...

متن کامل

Hierarchical Content Classification and Script Determination for Automatic Document Image Processing

Page segmentation and image content classi cation play an important role in automatic image processing with applications to mixed-type document image compression, form and check reading, and automatic mail sorting. In this paper, we rst present an enhanced background thinning based approach for fast page segmentation. After the analysis of three di4erent methods individually, a hierarchical app...

متن کامل

The Effectiveness of Shadow-Reading With and Without Written Script on Listening Comprehension of Iranian Intermediate EFL Students.

Listening comprehension is at the heart of language learning (Kurita, 2012). It is an importantlanguage skill to develop in terms of second language acquisition (SLA) (Dunkel, 1991; Rost,2001; Vandergrift, 2007).In spite of its importance, L2 learners often regard listening as themost difficult language skill to learn. In this study, shadowing as an act or task in listening, inwhich the learner...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Pattern Anal. Mach. Intell.

دوره 19  شماره 

صفحات  -

تاریخ انتشار 1997